Automatically Extracting Variant-Normalization Pairs for Japanese Text Normalization

نویسندگان

  • Itsumi Saito
  • Kyosuke Nishida
  • Kugatsu Sadamitsu
  • Kuniko Saito
  • Junji Tomita
چکیده

Social media texts, such as tweets from Twitter, contain many types of nonstandard tokens, and the number of normalization approaches for handling such noisy text has been increasing. We present a method for automatically extracting pairs of a variant word and its normal form from unsegmented text on the basis of a pair-wise similarity approach. We incorporated the acquired variant-normalization pairs into Japanese morphological analysis. The experimental results show that our method can extract widely covered variants from large Twitter data and improve the recall of normalization without degrading the overall accuracy of Japanese morphological analysis.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Japanese Text Normalization with Encoder-Decoder Model

Text normalization is the task of transforming lexical variants to their canonical forms. We model the problem of text normalization as a character-level sequence to sequence learning problem and present a neural encoder-decoder model for solving it. To train the encoder-decoder model, many sentences pairs are generally required. However, Japanese non-standard canonical pairs are scarce in the ...

متن کامل

Automatic paraphrasing based on parallel corpus for normalization

Abstract There are various ways to express the same meaning in natural language. This diversity causes difficulty in many fields of natural language processing. It can be reduced by normalization of synonymous expressions, which is done by replacing various synonymous expressions with a standard one. In this paper, we propose a method for extracting paraphrases from a parallel corpus automatica...

متن کامل

Morphological Analysis for Japanese Noisy Text based on Character-level and Word-level Normalization

Social media texts are often written in a non-standard style and include many lexical variants such as insertions, phonetic substitutions, abbreviations that mimic spoken language. The normalization of such a variety of non-standard tokens is one promising solution for handling noisy text. A normalization task is very difficult to conduct in Japanese morphological analysis because there are no ...

متن کامل

Improving Text Normalization via Unsupervised Model and Discriminative Reranking

Various models have been developed for normalizing informal text. In this paper, we propose two methods to improve normalization performance. First is an unsupervised approach that automatically identifies pairs of a non-standard token and proper word from a large unlabeled corpus. We use semantic similarity based on continuous word vector representation, together with other surface similarity ...

متن کامل

A Log-Linear Model for Unsupervised Text Normalization

We present a unified unsupervised statistical model for text normalization. The relationship between standard and non-standard tokens is characterized by a log-linear model, permitting arbitrary features. The weights of these features are trained in a maximumlikelihood framework, employing a novel sequential Monte Carlo training algorithm to overcome the large label space, which would be imprac...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017